Object recognition and computer vision 2021/2022

Jean Ponce, Ivan Laptev, Cordelia Schmid and Josef Sivic

Assignment 1: Instance-level recognition

Adapted from practicals from Andrea Vedaldi and Andrew Zisserman
by Gul Varol and Ignacio Rocco

</br>

STUDENT: GHAMGUI Eya

EMAIL: eya.ghamgui@telecom-paris.fr

Guidelines

The purpose of this assignment is that you get hands-on experience with the topics covered in class, which will help you understand these topics better. Therefore, it is imperative that you do this assignment yourself. No code sharing will be tolerated.

Once you have completed the assignment, you will submit the ipynb file containing both code and results. For this, make sure to run your notebook completely before submitting.

The ipynb must be named using the following format: A1_LASTNAME_Firstname.ipynb, and submitted in the class Moodle page.

Goal

The goal of instance-level recognition is to match (recognize) a specific object or scene. Examples include recognizing a specific building, such as Notre Dame, or a specific painting, such as `Starry Night’ by Van Gogh. The object is recognized despite changes in scale, camera viewpoint, illumination conditions and partial occlusion. An important application is image retrieval - starting from an image of an object of interest (the query), search through an image dataset to obtain (or retrieve) those images that contain the target object.

The goal of this assignment is to experiment and get basic practical experience with the methods that enable specific object recognition. It includes: (i) using SIFT features to obtain sparse matches between two images; (ii) using similarity co-variant detectors to cover changes in viewpoint; (iii) vector quantizing the SIFT descriptors into visual words to enable large scale retrieval; and (iv) constructing and using an image retrieval system to identify objects.

Setup environment

Download and install CyVLFeat

Imports

Download and uncompress data

Part 1: Sparse features for matching specific objects in images

Feature point detection

The SIFT feature has both a detector and a descriptor. The detector used in SIFT corresponds to the "difference of Gaussians" (DoG) detector, which is an approximation of the "Laplacian of Gaussian" (LoG) detector.

We will start by computing and visualizing the SIFT feature detections (usually called frames) for two images of the same object (a building facade). Load an image, rotate and scale it, and then display the original and transformed pair:

A SIFT frame is a circle with an orientation and is specified by four parameters: the center $x$, $y$, the scale $s$, and the rotation $\theta$ (in radians), resulting in a vector of four parameters $(x, y, s, \theta)$.

Now compute and visualise the SIFT feature detections (frames):

Examine the second image and its rotated and scaled version and convince yourself that the detections overlap the same scene regions (even though the circles' positions have moved in the image and their radius' have changed). This demonstrates that the detection process (is co-variant or equi-variant) with translations, rotations and isotropic scalings. This class of transformations is known as a similarity or equiform.

:: TASK 1.1 ::

Now repeat the exercise with a pair of natural images.

Start by loading the second one: data/oxbuild_lite/all_souls_000015.jpg

Plot the images and feature frames. Do not apply a synthetic rotation. Again you should see that many of the detections overlap the same scene region. Note that, while repeatability occurs for the pair of natural views, it is much better for the synthetically rotated pair.

#

Comments:

From the previous plots, we notice that the descriptors detected in the second natural image give less matches than those detected in the synthetically rotated image. I think this is related to the differences between the first natural image and the second natural image. In fact, they do not have the same camera position. In addition, the change in the color of the sky leads to create missmatches. Also, the reflection of the sun in the building will affect the selection of descriptors by changing the illumination of pixels and creating a shadow on the left part of the building. This shadow will create an occlusion effect, as we can see in this area, the number of descriptors has decreased.

#

The number of detected features can be controlled by changing the peakThreshold option. A larger value will select features that correspond to higher contrast structures in the image.

:: TASK 1.2 ::

For the same image, produce 3 sub-figures with different values of peakThreshold. Comment.

################

Answer:

SIFT detectors are controlled by the parameter: the peak threshold. The peak selection threshold is used to filter out peaks from the DoG scale space that are below the threshold (in absolute value).

From the previous results, we notice that with a low threshold value, we kept a high number of descriptors in the image. Indeed, when the threshold is equal to 0.005, many descriptors are retained even the non relevent one. By increasing its value, we eliminated more descriptors and we kept only the ones which are considered as points of interest. However, with a very high value of threshold, a large number of descriptors is eliminated. This result is seen from the last image when choosing the threshold equal to 0.07.

$\Longrightarrow$ We should well choose the threshold value to keep the best number of descriptors because this will have an impact on the matching step. The right threshold value will help us have a good matching. Many methods have been developed to choose the appropriate threshold value, for example the automatic thresholding.

#

Feature point description and matching

Introduction to feature descriptors

The parameters $(t_x, t_y, s, \theta)$ of the detected frames, can be used to extract a scaled and oriented RGB patch around $(t_x,t_y)$, used to describe the feature point.

The simplest possible descriptor would be to (i) resize these patches to a common size (eg. 30x30) and (ii) flatten to a vector. However, in practice we use more sophisticated descriptors such as SIFT, that is based on histograms of gradient orientations.

:: TASK 1.3 ::

What is the interest of using SIFT descriptors over these flattened RGB patches?

################

Answer:

The SIFT descriptor is very useful for image comparison and object recognition. The advantage of using this descriptor rather than flattened RGB patches is that it determines features that are invariant to rotations and scale transformations in the image domain and robust to moderate perspective transformations and illumination variations, which is not the case for flattened RGB patches.

In fact, the SIFT descriptor is invariant to scale transformations because the method uses images at different scales, then, it maximizes the Difference of Gaussian (DoG) in scale and space to find the same key points independently in each image. In addition, it is invariant to rotation because this method computes the histogram of gradient directions to identify the most dominant direction. Using this information, it becomes easy to compare two key points. Furthermore, these descriptors are less sensitive to illumination change because they are extracted from grayscale and normalized patches.

Add to that, these features are less heavier than the flattened RGB patches in term of memory and computational time. The SIFT descriptor is a vector of length 128 representing an array of 4x4 histograms with 8 orientation bins per histogram. However, flattened RGB patches are 30x30 sized vectors.

#

Descriptor matching

SIFT descriptors are 128-dimensional vectors, and can be directly matched to find correspondences between images. We will start with the simplest matching scheme (first nearest neighbour of descriptors in terms of Euclidean distance) and then add more sophisticated methods to eliminate any mismatches.

Be careful, the descriptor output by cyvlfeat.sift.sift is an array of integers, to avoid issues when substracting descriptors, convert it to an array of floats first with descriptor = descriptor.astype(np.float64).

In the following section, use a peak threshold of 0.01 .

:: TASK 1.4 ::

For each descriptor in im1, assign a matching descriptor in im2 by picking its first nearest neighbour.

Populate the second column of the matches vector.

#

Comments:

From the previous plot, we notice that there are a lot of mismatches between the descriptors. For example, there are descriptors in the sky that are linked to ones from the building. Thus, we should improve the matching results by eliminating those wrong matches by checking the spatial consistency.

#

Improving SIFT matching (i) using Lowe’s second nearest neighbour test

Lowe introduced a second nearest neighbour (2nd NN) test to identify, and hence remove, ambiguous matches. The idea is to identify distinctive matches by a threshold on the ratio of first to second NN distances.

The ratio is: $$NN_{ratio} = \frac{1^{st}\text{NN distance}}{2^{nd}\text{NN distance}}.$$

:: TASK 1.5 ::

For each descriptor in im1, compute the ratio between the first and second nearest neighbour distances.

Populate the ratio vector.

The ratio vector will be now used to retain only the matches that are above a given threshold.

A value of $NN_{threshold} = 0.8$ is often a good compromise between losing too many matches and rejecting mismatches.

#

Comments:

Using Lowe’s second nearest neighbour test, we succeeded in eliminating a large number of mismatches. However, this method do not provide perfect matches, for example, the descriptor in the grass. Therefore, we need to try other methods to improve the matching results.

#

Improving SIFT matching (ii) by geometric verification

In addition to the 2nd NN test, we can also require consistency between the matches and a geometric transformation between the images. For the moment we will look for matches that are consistent with a similarity transformation:

$$\begin{bmatrix} x_2 \\ y_2 \end{bmatrix} = sR(\theta) \begin{bmatrix} x_1 \\ y_1 \end{bmatrix} + \begin{bmatrix} t_x \\ t_y \end{bmatrix} $$

which consists of a rotation by $\theta$, an isotropic scaling (i.e. same in all directions) by s, and a translation by a vector $(t_x, t_y)$. This transformation is specified by four parameters $(s,\theta,t_x,t_y)$ and can be computed from a single correspondence between SIFT detections in each image.

:: TASK 1.6 ::

Given a detected feature with parameters $(x_1, y_1, s_1, \theta_1)$ in image $1$ matching a feature $(x_2, y_2, s_2, \theta_2)$ in image $2$, work out how to find out the parameters $(t_x,t_y,s,\theta)$ of the transformation mapping points from image $1$ to image $2$.

image.png

$$\theta=\theta_2 - \theta_1$$$$s= s_2\times \frac{1}{s_1} = \frac{s_2}{s_1}$$$$\begin{bmatrix} t_x \\ t_y \end{bmatrix} = \begin{bmatrix} x_2 \\ y_2 \end{bmatrix} - \frac{s_2}{s_1} R(\theta_2 - \theta_1) \begin{bmatrix} x_1 \\ y_1 \end{bmatrix} = \begin{bmatrix} x_2 \\ y_2 \end{bmatrix} - \frac{s_2}{s_1} \begin{bmatrix} cos(\theta_2 - \theta_1) ~~~-sin(\theta_2 - \theta_1) \\sin(\theta_2 - \theta_1) ~~~~~~~~ cos(\theta_2 - \theta_1) \end{bmatrix}\begin{bmatrix} x_1 \\ y_1 \end{bmatrix} \\ = \begin{bmatrix} x_2 - \frac{s_2}{s_1}(x_1\cos(\theta_2 - \theta_1) - y_1\sin(\theta_2 - \theta_1)) \\ y_2 - \frac{s_2}{s_1}(x_1\sin(\theta_2 - \theta_1) + y_1\cos(\theta_2 - \theta_1)) \end{bmatrix}$$

The matches consistent with a similarity can then be found using the RANSAC algorithm, described by the following steps:

For each tentative correspondence in turn:

Finally, choose the transformation with the highest count of inliers

After this algorithm the inliers are consistent with the transformation and are retained, and most mismatches should now be removed.

:: TASK 1.7 ::

#

Comments:

After using the geometric verification method, we found that we have successfully eliminated all mismatches. The remaining descriptors are well matched and describe the points of interest in the images. In fact, the RANSAC method is a good method to determine matches between images as it is robust to outliers.

#

Part 2: Compact descriptors for image retrieval

In large scale retrieval the goal is to match a query image to a large database of images (for example the WWW or Wikipedia).

The quality of an image match is measured as the number of geometrically verified feature correspondences between the query and a database image. While the techniques discussed in Part 1 are sufficient to do this, in practice they require too much memory to store the SIFT descriptors for all the detections in all the database images.

In this part we will see how we can compute a global image descriptor from the set of SIFT descriptors using the bag-of-visual-words (BoVW) approach.

Then, we will see how these global descriptors can be used to rapidly retrieve a shortlist of candidate database images given a query image. Finally, we will see how to re-rank the shortlist of candidates using a geometric verification technique that requires only the detector frames and their assigned visual word indices; remember the SIFT descriptors are only used to compute the compact BoVW descriptors and then discarded.

Download preprocessed dataset of paintings

Load preprocessed dataset of paintings

We will now load the preprocessed dataset of paintings. The construction of this dataset has involved several steps.

  1. SIFT features were extracted from all painting in the dataset
  2. A global vocabulary of SIFT descriptors was computed using K-means clustering. These are the visual words of our dataset.
  3. The SIFT features for each painting were assigned to the nearest word, and a compact descriptor was generated for each painting. This compact descriptor consists in the normalized histogram of words. The histogram normalization itself involves 3 different steps:
    • a) TF-IDF weighting: each word is re-weighted according to its TF-IDF value. This removes weights to words that are very common and therefore not too descriptive.
    • b) Square-rooting: each element is square-rooted
    • c) L2-normalization: The whole histogram is L2-normalized.

:: TASK 2.1 ::

Construct a KDTree of the vocabulary for fast NN search. Then use the KDTree to find the closest word in the vocabulary to each descriptor of the query image.

:: TASK 2.2 ::

Compute the compact BoVW descriptor of the query image.

:: TASK 2.3 ::

Compute the matching score with each image from the database.

#

Comments:

To compute the matching score between the two histograms, I chose the Normalized Scalar Product method described in the lecture note. This method calculates the similarity of two histograms, with possible value lying between 0 (no overlap) and 1 (equal values). The higher the metric, the more accurate the match.

There are also other methods used to solve this issue such that the euclidean distance, histogram intersection, Jaccard distance, etc. These metrics are used to calculate the similarities between two histograms.

#
#

Comments:

From the previous results, we notice that the two best scores give two correct matches.

#

AUTHORSHIP STATEMENT

I declare that the preceding work was the sole result of my own effort and that I have not used any code or results from third-parties.

GHAMGUI Eya